12 Data Manipulation (2.4)

12.1 Learning Outcomes

By the end of this tutorial, you should:

be confident in cleaning datasets, preparing them for analysis
understand how to manipulate variables, and create new variables, in preparation for analysis

12.2 Introduction

A lot of the data you will deal with in sport is ‘messy’. That is, it may not land on your screen in a format that is immediately ready for analysis.

Therefore, some of the key steps we need to take include:

making sure we can read and write the data files
making sure the variables are named consistently and accurately
dealing with missing data
dealing with outliers
making sure each variable type is defined correctly
we have all the variables that we need to conduct our analysis

The following sections describe how a number of these steps can be achieved in R. From my experience, this stage is likely to be the most time-consuming (and frustrating) part of any data analysis, so it’s worth making sure you fully understand the commands and processes outlined below.

12.3 The ‘tidyverse’ package

‘tidyverse’ is a collection of R packages designed for data manipulation, exploration, and visualization.

Two of the core packages within tidyverse, ‘readr’ and ‘dplyr’, provide useful functions for reading and writing data in various formats.

We encountered one part of the tidyverse package, ‘tibbles’, Section 9.8. However, there are a lot of other functions within this package that make it well worth learning.

12.4 Reading and Writing Data using ‘tidyverse’

12.4.1 Reading Data - ‘readr’

‘readr’ is a package within Tidyverse that provides functions to read data from common file formats, including CSV, TSV, and fixed-width files. It’s designed for fast and efficient data reading with user-friendly parsing and type conversion.

To read data from a CSV file, you can use the ‘read_csv()’ function as follows:

# Install and load tidyverse
install.packages("tidyverse") # only needed if you've not already installed tidyverse
library(code)

# Import a .csv file
data <- read_csv("data.csv")

For other file formats, we can use the corresponding ‘tidyverse’ read functions, such as ‘read_tsv()’ for TSV files or ‘read_fwf()’ for fixed-width files.

12.4.2 Writing Data - ‘readr’

‘readr’ also provides functions for writing data to common file formats, like CSV and TSV. For example, the ‘write_csv()’ function can be used to write a data frame to a CSV file:

# Write data to a CSV file
write_csv(data, "output.csv")

Similarly, use the ‘write_tsv()’ function to write data to a TSV file.

12.5 Data cleaning and transformation

‘dplyr’ is another core package within tidyverse that provides a set of tools for data manipulation, such as filtering, selecting, and summarising data.

The following procedures are some of the most common ones you’ll want to use when preparing your data prior to analysis.

12.5.1 Filtering rows with a specific value, and selecting specific columns

# Load the required packages
library(tidyverse)

# create a sample tibble
data <- tibble(
 id = 1:5,
 category = c("A", "B", "A", "B", "A"),
 value = c(23, 45, 12, 78, 37)

)

# Filter rows with category 'A' and select columns 'id' and 'value'

filtered_data <- data %>%
 filter(category == "A") %>%
 select(id, value)
print(filtered_data)

12.5.2 Selecting specific columns

# Select columns by name

selected_data <- data %>%
 select(column1, column2, column3)

12.5.3 Sorting your data by column values


# Sort data in ascending order

sorted_data <- data %>%
 arrange(column_name)

# Sort data in descending order

sorted_data <- data %>%
 arrange(desc(column_name))

12.5.4 Creating new columns

# Add a new column with calculated values

new_data <- data %>%
 mutate(new_column = column1 \* column2)

12.5.5 Grouping data, and performing aggregations

# Group data by a column and calculate the mean of another column

grouped_data <- data %>%
 group_by(column1) %>%
 summarize(mean_value = mean(column2))

12.5.6 Combining reading, writing, and manipulating Data

You can chain together reading, manipulating, and writing data using the ‘%>%’ pipe operator from ‘tidyverse’.

This allows for a more readable and efficient workflow.

In the following example, we read data from a CSV file, filter it, and then write it back to a new CSV file.

# Read data from a CSV file, filter, and write to a new CSV file

read_csv("data.csv") %>%
 filter(category == "A") %>%
 select(id, value) %>%
 write_csv("filtered_data.csv")

12.6 Data reshaping with ‘tidyr’

‘tidyr’ is another core tidyverse package that provides functions for cleaning and reshaping data. It helps create “tidy” data, where each variable is a column, and each observation is a row.

12.7 Gather multiple columns into key-value pairs (wide to long format)

# Gather columns 'column1', 'column2', and 'column3' into key-value pairs

long_data <- data %>%
gather(key = "variable", value = "value", column1, column2, column3)

12.8 Spread key-value pairs into separate columns (long to wide format)

# Spread key-value pairs in 'variable' and 'value' columns into separate columns

wide_data <- data %>%
spread(key = variable, value = value)

12.9 Separate a single column into multiple columns

# Separate 'column_name' into two new columns 'column1'

and 'column2', splitting by a delimiter (e.g., '-')
separated_data <- data %>%
 separate(column_name, into = c("column1", "column2"), sep = "-")

12.9.1 Unite multiple columns into a single column

# Unite columns into a new column 'column_name', with delimiter (e.g., '-')

united_data <- data %>%
 unite(column_name, column1, column2, sep = "-")

12.10 Alternative Approaches

In the preceding sections, we used the ‘tidyverse’ package to conduct various operations on a newly-imported dataset.

Before ‘tidyverse’ was introduced, it was also possible to conduct the same operations, using the following code.

You may still find these approaches useful.

12.10.1 Reading a .csv file

# load library
library(dplyr)

# import data

data <- read.csv('/Users/directory/filename.csv')
head(data) # display the first six rows
str(data)  # inspect variable types

12.10.2 Remove outliers (‘999’)}

# Check each row if 999 is present and, if so, replace with 'NA'

data[data == 999, ]

12.10.3 Removing rows with missing data

data01 <- na.omit(data) # removes any row (observation) with missing data

12.10.4 Remove a variable ‘X’ from dataset ‘data’

data<- subset(data, select = -c(X))

12.10.5 Rename variables in dataset ‘data’ and create a new dataset ‘data02’

data02 <-rename(data, var_01 = A, var_02 = B) # create the new dataset

rm(data) # remove original dataset from memory

12.10.6 Change variable types

data02$game_id = as.factor(data02\$game_id) # make game_id a factor

data02$opposition_id = as.factor(data02\$opposition_id) # make opposition_id a factor

12.10.7 Create a logical true/false variable

In our current dataset, a home game is indicated as either ‘Yes’ or ‘No’. We want to transform this variable into a logical type, which in R means using TRUE or FALSE.

# first, we replace the existing values with TRUE or FALSE

data02$home_game[data02$home_game == 'Yes'] <- TRUE
data02$home_game[data02$home_game == 'No'] <- FALSE

# then, we change the variable type to logical
data02$home_game = as.logical(data02$home_game)

# finally, we check out dataset and make sure the variable type has changed
head(data02) # show first six rows
str(data02) # inspect variable types

12.10.8 Calculate a new variable based on two existing variables

Often, you will want to create new variables based on the existing data within your dataset. This can be done as follows (assuming both variables are of the same type):

data02$var_03 <- data02$var_02-data02$var_01

12.11 Activity: Importing, cleaning, and transforming a dataset

Note: all of the following steps should be conducted in a single R Script.

url <- "https://www.dropbox.com/scl/fi/w6cmx5fgl6y5e1sizeskf/t08_data_b1700_01.csv?rlkey=trab5xa3hpqhf01ruhx1w2ers&dl=1"
data <- read.csv(url)
rm(url)

Working on the dataframe ‘data’, and with the dplyr library installed and loaded:

Use the ‘head’, ‘str’ and ‘summary’ commands to overview the dataset.
In this dataset, ‘999’ represents an outlier. Replace these values with ‘NA’.
Remove any observations with missing data (‘NA’).
Remove the variable ‘X’ from the dataset.
Rename each of the variables to the format ‘var_id’, ‘var_a’ etc.
Change the variable types as follows: var_id = factor, var_a = factor.
Use the ‘str’ function to check variable types.
Calculate a new variable, ‘var_e’, which is the sum of var_c + var_d.
Write the dataset as a .csv file to an appropriate location where you might be able to access it at a later date.

Check your answers:

Show the answer

# Overview the dataset
head(data)

                  id       a     b  c  d         X
1      Winnipeg Jets   David  TRUE 46 16  27.73578
2     Minnesota Wild William FALSE 35 63  62.23405
3 New York Islanders Michael  TRUE 13  2  34.34055
4  Los Angeles Kings Michael FALSE 80 44 999.00000
5      Boston Bruins   James  TRUE 28 16  55.25340
6       Dallas Stars  Robert  TRUE 97 20  24.44302

Show the answer

str(data)

'data.frame':   1000 obs. of  6 variables:
 $ id: chr  "Winnipeg Jets" "Minnesota Wild" "New York Islanders" "Los Angeles Kings" ...
 $ a : chr  "David" "William" "Michael" "Michael" ...
 $ b : logi  TRUE FALSE TRUE FALSE TRUE TRUE ...
 $ c : int  46 35 13 80 28 97 999 8 51 68 ...
 $ d : int  16 63 2 44 16 20 65 17 9 38 ...
 $ X : num  27.7 62.2 34.3 999 55.3 ...

Show the answer

summary(data)

      id                 a                 b                 c         
 Length:1000        Length:1000        Mode :logical   Min.   :  1.00  
 Class :character   Class :character   FALSE:482       1st Qu.: 28.00  
 Mode  :character   Mode  :character   TRUE :518       Median : 50.00  
                                                       Mean   : 60.25  
                                                       3rd Qu.: 76.00  
                                                       Max.   :999.00  
                                                       NA's   :9       
       d                X          
 Min.   :  1.00   Min.   :  3.064  
 1st Qu.: 24.00   1st Qu.: 40.913  
 Median : 49.00   Median : 49.647  
 Mean   : 57.71   Mean   : 69.190  
 3rd Qu.: 75.00   3rd Qu.: 60.380  
 Max.   :999.00   Max.   :999.000  
 NA's   :8        NA's   :10

Show the answer

# Replace 999 with NA in the dataset
data[data == 999] <- NA

# Remove observations with missing data
data_clean <- na.omit(data)

# Remove variable 'X'
data_clean$X <- NULL

# Rename variables
names(data_clean) <- c("var_id", "var_a", "var_b", "var_c", "var_d")

# Change variable types
data_clean$var_id <- as.factor(data_clean$var_id)
data_clean$var_a <- as.factor(data_clean$var_a)

# Use the 'str' function to check variable types
str(data_clean)

'data.frame':   823 obs. of  5 variables:
 $ var_id: Factor w/ 31 levels "Anaheim Ducks",..: 31 15 19 3 10 22 11 5 20 14 ...
 $ var_a : Factor w/ 7 levels "David","James",..: 1 7 5 2 6 6 7 6 6 5 ...
 $ var_b : logi  TRUE FALSE TRUE TRUE TRUE FALSE ...
 $ var_c : int  46 35 13 28 97 8 51 68 43 56 ...
 $ var_d : int  16 63 2 16 20 17 9 38 32 89 ...
 - attr(*, "na.action")= 'omit' Named int [1:177] 4 7 19 29 34 41 45 49 55 56 ...
  ..- attr(*, "names")= chr [1:177] "4" "7" "19" "29" ...

Show the answer

# Calculate 'var_e' as the sum of 'var_c' + 'var_d'
data_clean$var_e <- data_clean$var_c + data_clean$var_d